Lesson 5. Data-driven Mapping

Data-driven mapping refers to the process of using data values to determine the symbology of mapped features. Color, shape, and size are the three most common graphic elements used to symbolize data-driven maps. Data-driven maps are often referred to as thematic maps.


Instructor Notes

Types of Thematic Maps

There are two primary types of thematic maps:

  • Choropleth maps: set the color of areas (polygons) by data value

  • Point symbol maps: set the color or size of points by data value

We review both of these types of maps in more detail in this lesson. First, let’s take a quick look at choropleth maps.

library(sf)
library(tmap)

5.1 Choropleth Maps

Choropleth maps are the most common type of thematic map.

Let’s use an sf data.frame of counties data to make a choropleth map.

First, read in the counties data with the st_read function.

counties = st_read('notebook_data/california_counties/CaliforniaCounties.shp')
## Reading layer `CaliforniaCounties' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/california_counties/CaliforniaCounties.shp' using driver `ESRI Shapefile'
## Simple feature collection with 58 features and 23 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -374445.4 ymin: -604500.7 xmax: 540038.5 ymax: 450022
## Projected CRS: NAD83 / California Albers

Then, make a map of our counties.

plot(counties$geometry)

Now, take a look at the spatial dataframe.

head(counties)
## Simple feature collection with 6 features and 23 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -267387.9 ymin: -578013.2 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
##          NAME STATE_NAME POP2012  POP12_SQMI   WHITE  BLACK AMERI_ES   ASIAN
## 1        Kern California  851089  104.282870  499766  48921    12676   34846
## 2       Kings California  155039  111.427421   83027  11014     2562    5620
## 3        Lake California   65253   49.082334   52033   1232     2049     724
## 4      Lassen California   35039    7.422856   25532   2834     1234     356
## 5 Los Angeles California 9904341 2423.264150 4936599 856874    72828 1346865
## 6      Madera California  153025   71.065672   94456   5629     4136    2802
##   HAWN_PI HISPANIC   OTHER MULT_RACE   MALES FEMALES MED_AGE HOUSEHOLDS
## 1    1252   413033  204314     37856  433108  406523    30.7     254610
## 2     271    77866   42996      7492   86344   66638    31.1      41233
## 3     108    11088    5455      3064   32469   32196    45.0      26548
## 4     165     6117    3562      1212   22416   12479    37.0      10058
## 5   26094  4687889 2140632    438713 4839654 4978951    34.8    3241204
## 6     162    80992   37380      6300   72682   78183    33.1      43317
##   FAMILIES HSE_UNITS AVE_FAM_SZ VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1   191739    284367       3.61  29757    152828     101782      06103
## 2    31939     43867       3.59   2634     22329      18904      06089
## 3    16255     35492       2.94   8944     17472       9076      06106
## 4     6800     12710       2.98   2652      6590       3468      06086
## 5  2194080   3445076       3.58 203872   1544749    1696455      06073
## 6    34093     49140       3.63   5823     27726      15591      06102
##                         geometry
## 1 MULTIPOLYGON (((213672.6 -2...
## 2 MULTIPOLYGON (((12524.03 -1...
## 3 MULTIPOLYGON (((-235734.3 1...
## 4 MULTIPOLYGON (((12.28914 35...
## 5 MULTIPOLYGON (((173874.5 -4...
## 6 MULTIPOLYGON (((16681.16 -1...

In particular, we are interested in the columns with numeric values as these are the ones typically used to make data maps.

To get started, let’s create a choropleth map by setting the color of each county based on the value in the population per square mile column (POP12_SQMI).

Recall that sf’s plot method does this by default! So, here’s the quickest way to make a choropleth:

plot(counties['POP12_SQMI'])

By default, sf::plot linearly scales the colors to the data values. This is called a proportional color map.

Choropleth mapping with tmap

We can also use tmap to create thematic maps. This package gives us greater control over the visualization details.

In tmap, instead of setting the col argument to the same static value (e.g. ‘red’, ‘#ef03a5’) for all features, we can set it to the name of the column by which we want our polygons colored (e.g. ‘POP12_SQMI’).

# Set the mapping mode to a static plot (not interactive)
tmap_mode('plot')  
## tmap mode set to plotting
# Map the county polygons colored by the values in the POP12_SQMI column
tm_shape(counties) + 
  tm_polygons(col='POP12_SQMI',
              title = "Population Density per mi^2")

By default, tmap uses a yellow-orange-brown (YlOrBr) sequential color palette for thematic maps and bins those colors into 3 to 7 classes of approximately equal intervals with rounded values for class breaks.

Of course, we can also use tmap’s interactive mapping mode. Do you recall the syntax for:

  • setting the tmap mode to static vs interactive mapping?

  • or toggling between these two modes?

Let’s make an interactive map, making our layer partially transparent, i.e. alpha = 0.4, so that we can see the basemap through our polygons.

  • This transparency may be more or less noticeable depending on the selected basemap!
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(counties) +
  tm_polygons(col='POP12_SQMI', alpha=0.5,
              title = "Population Density per mi^2")

That’s really the heart of of creating a choropleth map with tmap. To set the color of the features based on the values in a column, set the col argument to the column name in the sf data.frame (cast as a string!).

Practice

Redo the map above, but mapping population (POP2012) NOT population density.

# Map of County Population

Question

What map better conveys county population - POP12_SQMI or POP2012?

The Challenge of Thematic Maps

The goal of a thematic map is to use color to visualize the spatial distribution of a variable.

Another goal is to use color to effectively and quickly convey information. For example,

  • maps use brighter or richer colors to signify higher values,

  • and leverage cognitive associations such as mapping water with the color blue.

There are two major challenges when creating thematic maps:

  1. Our eyes are drawn to the color of larger areas or linear features, even if the values of smaller features are more significant.

  2. The range of data values is rarely evenly distributed across all observations and thus the colors can be misleading.

Questions

  • Do you see this either of these problems in our population-density map?

    • Take a look at the histogram below as you consider the above question.
hist(counties$POP12_SQMI,breaks=40, main = 'Population Density per mi^2')

There are three main techniques for dealing with these mapping challenges:

  1. Color palettes

  2. Data transformations

  3. Classification schemes

5.2 Color Palettes

There are three main types of color palettes (or color maps), each of which has a different purpose:

  • diverging - a “diverging” set of colors are used so emphasize mid-range values as well as extremes.

  • sequential - usually with a single or multi color hue to emphasize differences in order and magnitude, where darker colors typically mean higher values

  • qualitative - a contrasting set of colors to identify distinct categories and avoid implying quantitative significance.

Tip: Sites like ColorBrewer let’s you play around with different types of color maps.

To see the names of all color palettes avaialble to tmap, try the following command. You may need to enlarge the output image.

RColorBrewer::display.brewer.all()

As a best practice, a qualitative color palette should not be used with quantitative data and vice versa. For example, consider this map that EDM.com published of top dance tracks by state.

5.3 Transforming Count Data

For a number of reasons, data are often distributed in aggregated form. For example, the Census Bureau collects data from individual people, households and businesses and distributes it aggregated to states, counties, and census tracts, etc.

When the aggregated data are counts, like total population, they can be transformed to densities, proportions and ratios. These normalized variables are more comparable across regions that differ greatly in size.

Let’s consider this in terms of our data.

  • Counts
    • data counts, aggregated by feature
      • e.g. population within a county
  • Densities
    • counts aggregated by feature and normalized by feature area
      • e.g. population per square mile within a county
  • Proportions / Percentages
    • value in a specific category divided by total value across in all categories
      • e.g. proportion of the county population that is white compared to the total county population
  • Rates / Ratios
    • value in one category divided by value in another category
      • e.g. homeowner-to-renter ratio would be calculated as the number of homeowners (c_owners/ c_renters)

The basic cartographic rule is that when mapping areas that differ in size you never map counts since those differences in size make the comparison less invalid.

5.4 Classification schemes

Another way to make more meaningful maps is to improve the way in which data values are mapped to colors.

The common alternative to a proportional color map is to use a classification scheme to create a graduated color map. This is the standard way to create a choropleth map.

A classification scheme is a method for binning continuous data values into 4-7 classes (the default is 5) and map those classes to a color palette.

The commonly used classifications schemes:

  • Equal intervals or Pretty
    • equal-size data ranges (e.g., values within 0-10, 10-20, 20-30, etc.)
    • pros:
      • best for data spread across entire range of values
      • easily understood by map readers
    • cons:
      • but avoid if you have highly skewed data or a few big outliers
  • Quantiles
    • equal number of observations in each bin
    • pros:
      • looks nice, becuase it best spreads colors across full set of data values
      • thus, it’s often the default scheme for mapping software
    • cons:
      • bin ranges based on the number of observations, not on the data values
      • thus, different classes can have very similar or very different values.
  • Natural breaks
    • minimize within-class variance and maximize between-class differences
    • e.g. ‘fisher-jenks’,
    • pros:
      • great for exploratory data analysis, because it can identify natural groupings
    • cons:
      • class breaks are best fit to one dataset, so the same bins can’t always be used for multiple years
  • Head/Tails
    • a new relatively new scheme for data with a heavy-tailed distribution
  • Manual
    • classifications are user-defined
    • pros:
      • especially useful if you want to slightly change the breaks produced by another scheme
      • can be used as a fixed set of breaks to compare data over time
    • cons:
      • more work involved

Classification schemes and tmap

Classification schemes can be implemented using the tmap geometry functions (tm_polygons, tm_dots, etc.) by setting a value for the style argument.

Here are some of the tmap keyword names for classification styles that we can use (from the docs: ?tm_polygons):

  • equal, quantile,fisher, jenks, headtails, fixed, kmeans, pretty.

For more information about these classification schemes see ?classIntervals or sources such as this page in the Lovelace, Nowosad, and Muenchow ebook.


Classification schemes in action

Let’s redo the last map using the quantile classification scheme.

  • What is different about the code? About the output map?
tmap_mode('plot')
## tmap mode set to plotting
# Plot population density - mile^2
tm_shape(counties) + 
  tm_polygons(col = 'POP12_SQMI',
              style="quantile",
              alpha=0.5,
              title="Population Density per mi^2")

Practice

Redo the previous map with these classification schemes: headtails, equal, jenks

  • Which one do you like best?

User Defined Classification Schemes

You may get pretty close to your final map without being completely satisfied. In this case you can manually define a classification scheme.

Let’s customize our map with a user-defined classification scheme where we manually set the breaks for the bins using the classification_kwds argument.

tm_shape(counties) + 
  tm_polygons(col = 'POP12_SQMI',
              palette = "YlGn", 
              style='fixed',
              breaks = c(0, 50, 100, 200, 300, 400, max(counties$POP12_SQMI)),
              title = "Population Density per Sq Mile")

Since we are customizing our plot, we can also edit our legend to specify the text, so that it’s easier to read.

  • We’ll use tm_add_legend to build our own customized legend.
tm_shape(counties) + 
  tm_polygons(col = 'POP12_SQMI',
              palette = "YlGn", 
              style='fixed',
              breaks = c(0, 50, 100, 200, 300, 400, max(counties$POP12_SQMI)),
              legend.show = F) +
tm_add_legend('fill', col = RColorBrewer::brewer.pal(6, "YlGn"),
              border.col = "black",
              title = "Population Density per Sq Mile",
              labels = c('<50','50 to 100','100 to 200','200 to 300','300 to 400','>400'))

Let’s plot a ratio

If we look at the columns in our dataset, we see we have a number of variables from which we can calculate proportions, rates, and the like.

Let’s try that out:

head(counties)
## Simple feature collection with 6 features and 23 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -267387.9 ymin: -578013.2 xmax: 216677.6 ymax: 352693.6
## Projected CRS: NAD83 / California Albers
##          NAME STATE_NAME POP2012  POP12_SQMI   WHITE  BLACK AMERI_ES   ASIAN
## 1        Kern California  851089  104.282870  499766  48921    12676   34846
## 2       Kings California  155039  111.427421   83027  11014     2562    5620
## 3        Lake California   65253   49.082334   52033   1232     2049     724
## 4      Lassen California   35039    7.422856   25532   2834     1234     356
## 5 Los Angeles California 9904341 2423.264150 4936599 856874    72828 1346865
## 6      Madera California  153025   71.065672   94456   5629     4136    2802
##   HAWN_PI HISPANIC   OTHER MULT_RACE   MALES FEMALES MED_AGE HOUSEHOLDS
## 1    1252   413033  204314     37856  433108  406523    30.7     254610
## 2     271    77866   42996      7492   86344   66638    31.1      41233
## 3     108    11088    5455      3064   32469   32196    45.0      26548
## 4     165     6117    3562      1212   22416   12479    37.0      10058
## 5   26094  4687889 2140632    438713 4839654 4978951    34.8    3241204
## 6     162    80992   37380      6300   72682   78183    33.1      43317
##   FAMILIES HSE_UNITS AVE_FAM_SZ VACANT OWNER_OCC RENTER_OCC CountyFIPS
## 1   191739    284367       3.61  29757    152828     101782      06103
## 2    31939     43867       3.59   2634     22329      18904      06089
## 3    16255     35492       2.94   8944     17472       9076      06106
## 4     6800     12710       2.98   2652      6590       3468      06086
## 5  2194080   3445076       3.58 203872   1544749    1696455      06073
## 6    34093     49140       3.63   5823     27726      15591      06102
##                         geometry
## 1 MULTIPOLYGON (((213672.6 -2...
## 2 MULTIPOLYGON (((12524.03 -1...
## 3 MULTIPOLYGON (((-235734.3 1...
## 4 MULTIPOLYGON (((12.28914 35...
## 5 MULTIPOLYGON (((173874.5 -4...
## 6 MULTIPOLYGON (((16681.16 -1...

Let’s calculate the percent of the population that is hispanic and save it to a new column. Then, we can use that to create a choropleth map.

# calculate percent hispanic as a new column
counties$pct_hispanic = counties$HISPANIC/counties$POP2012 * 100

# Plot percent hispanic as choropleth
tm_shape(counties) + 
  tm_polygons(col = 'pct_hispanic',
              palette = 'Blues', 
              style = 'fixed',
              breaks= c(0,20,40,60,80,100),
              border.col = "darkgrey",
              lwd = 1.5,
              legend.show=F) + 
tm_add_legend('fill', col = RColorBrewer::brewer.pal(5, "Blues"),
              border.col = "darkgrey",
              title = "Percent Hispanic Population",
              labels = c('<20%','20% - 40%','40% - 60%','60% - 80%','80% - 100%'))

Question

  1. What new options and operations have we added to our code?

  2. How many values do we specify in the breaks vector, and how many bins are in the map legend? Why?

5.5 Point Maps

Choropleth maps are great, but point maps enable us to visualize our spatial data in another way.

If you know both mapping methods you can expand how much information you can show in one map.

For example, point maps are a great way to map counts because the varying sizes of areas are deemphasized.

The tm_dot element makes it easy to create point maps dynamically from polygon data!

# County population counts as a point map!
tmap_mode('plot')
## tmap mode set to plotting
# Add the county polygon borders as a basemap
tm_shape(counties) + 
  tm_borders(col="grey") +
  
# Then map the county centroids as points colored by population counts
  tm_shape(counties) + 
  tm_dots(col = 'POP2012',
              palette = 'YlOrRd', 
              style = 'jenks',
              border.col = "black",  # dot borders only visible in interactive mode!
              border.lwd = 1,
              border.alpha=1,
              size=.5,
              legend.show=T) 

This is another useful type of data transformation for making effective maps.

More Point Data Maps

Let’s read in some data that is more typically encoded with point geometry - Alameda County schools.

schools_df = read.csv('notebook_data/alco_schools.csv')
head(schools_df)
##           X        Y                      Site               Address    City
## 1 -122.2388 37.74476 Amelia Earhart Elementary 400 Packet Landing Rd Alameda
## 2 -122.2519 37.73900       Bay Farm Elementary   200 Aughinbaugh Way Alameda
## 3 -122.2589 37.76206  Donald D. Lum Elementary    1801 Sandcreek Way Alameda
## 4 -122.2348 37.76525         Edison Elementary  2700 Buena Vista Ave Alameda
## 5 -122.2381 37.75396     Frank Otis Elementary      3010 Fillmore St Alameda
## 6 -122.2616 37.76911       Franklin Elementary  1433 San Antonio Ave Alameda
##   State Type API    Org
## 1    CA   ES 933 Public
## 2    CA   ES 932 Public
## 3    CA   ES 853 Public
## 4    CA   ES 927 Public
## 5    CA   ES 894 Public
## 6    CA   ES 893 Public

We got it from a plain CSV file, let’s promote it to an sf data.frame.

schools_sf = st_as_sf(schools_df, 
                      coords = c('X','Y'),
                      crs = 4326)

Then we can map it.

plot(schools_sf)

What is useful about the above display of the maps for each column in the dataframe is that at a glance you can see the type of data variable and get a sense of the range of values.

The default sf::plot point map for a numeric data column is a proportional color map that linearly scales the color of the point symbol by the data values.

# Point map of API - Academic Performance Index
plot(schools_sf['API'])

Point maps with tmap

Let’s try creating the same map with tmap.

tmap_mode('plot')
## tmap mode set to plotting
tm_shape(schools_sf) + 
  tm_dots(col="API")

The basic tmap graduated color map needs some customization to shine, especially in plot mode!

By default, tmap uses a yellow-orange-brown (YlOrBr) sequential color palette and the pretty classification scheme for point thematic maps. These are the same defaults that are used for tmap choropleth maps. But point maps that symbolize data values by color are called Graduated Color Maps. In spite of the different map names, the color and classification scheme options are almost identical in tmap! However, some options will be different - for example, a size parameter makes sense for a point radius but not a polygon!

See ?tm_dot for more information about the options for customizing point maps! For example…

# API Graduated Color Map
tm_shape(schools_sf) +
  tm_dots(col='API', 
          size=0.15,
          palette='Reds', 
          style='fixed',
          breaks=c(0, 200, 400, 600, 800, 1000),
          border.col='grey',
          legend.show=F) + 
  
tm_add_legend('fill', title='Alameda County, school API scores',
              labels = c('<200', '[200,400)', '[400,600)', '[600,800)', '>800'),
              col = RColorBrewer::brewer.pal(5, "Reds")) +
  
tm_layout(legend.position = c('right','top'))

Proportional Symbol Maps

Another important type of point map is the proportional symbol map. These are like proportial color maps but instead of associating symbol color with data values they associate symbol size. You can make these in tmap with the tm_bubbles function.

The schools data does not contain any good variables for proportional symbol mapping so we will read in a supplemental file of NCES data and join it to the school points.

df = read.csv('notebook_data/other/PolicyMap_NCES_Data_20210429.csv')
#head(df,2)
df2 = df[c('School.Name','Student.Teacher.Ratio','Free.and.Reduced.price.Lunch.Eligible.Students')]
colnames(df2) <- c('Site','STRatio','RLunch')
#head(df2,2)
schools_sf2 <- merge(schools_sf, df2, by="Site")
#head(schools_sf2,2)
tmap_mode('plot')
## tmap mode set to plotting
tm_shape(schools_sf2) + 
  tm_bubbles(size="RLunch", 
             col="pink", 
             border.col='black', 
             title.size="Students Eligible for Free/Reduced Lunch" ) +
  tm_layout( legend.position = c('right','top'))

5.5 Mapping Categorical Data

Mapping categorical data, also called qualitative data, is a bit more straightforward. There is no need to scale or classify data values. The goal of the color map is to provide a contrasting set of colors so as to clearly delineate different categories. Here’s a point-based example:

tm_shape(schools_sf) + 
  tm_dots(col='Org', size=0.15, palette='Spectral', title="School Type")

5.6 Recap

We learned about important data driven mapping strategies and mapping concepts, including:

  • Choropleth Maps
  • Color Palettes
  • Classification Schemes
  • Point maps

Point and polygons are not the only geometry-types that we can use in data-driven mapping! You can also map linear features by associating data values with the color, shape and size of features. But these types of maps are less common.

Exercise: Data-Driven Mapping

Practice creating choropleth and graduated color maps with the counties data. Pick one quantitative variable like MED_AGE and try different color palettes and classification schemes.


 D-Lab @ University of California - Berkeley
 Team Geo